Dataset loaded: 59,220 rows, 56 columns
Machine Learning Methods
Clustering and Predictive Modeling for Job Market Analysis
1 Introduction
This section applies machine learning techniques to uncover patterns in job market data, with a specific focus on Business Analytics, Data Science, and Machine Learning roles. As job seekers entering these competitive fields in 2024, understanding the hidden structures in job postings, predicting salary ranges, and identifying role characteristics can provide strategic advantages in career planning.
We employ three complementary machine learning approaches:
- K-Means Clustering: To discover natural groupings in BA/DS/ML job postings
- Regression Models: To predict salary ranges based on job characteristics
- Classification Models: To distinguish between different role types
2 Data Filtering for BA/DS/ML Analysis
To focus our analysis on relevant career paths for Business Analytics, Data Science, and Machine Learning professionals, we filter the dataset to include only positions matching these disciplines.
Filtered to BA/DS/ML jobs: 15,378 postings
Percentage of total dataset: 25.97%
Top 10 Job Titles:
TITLE_NAME
Data Analysts 6409
ERP Business Analysts 369
Data Analytics Engineers 343
Data Analytics Interns 328
Lead Data Analysts 319
Data Analytics Analysts 256
Master Data Analysts 234
Business Intelligence Data Analysts 223
IT Data Analytics Analysts 221
SAP Business Analysts 206
Name: count, dtype: int64
3 Feature Engineering
Before applying machine learning algorithms, we need to prepare our features. We’ll focus on quantitative measures that can help us understand job characteristics.
Feature Summary:
AVG_SALARY EXPERIENCE_YEARS DURATION_DAYS IS_REMOTE
count 15378.000000 15378.000000 15378.000000 15378.000000
mean 95154.236185 4.564053 20.494863 0.188256
std 24709.977582 2.199538 11.234962 0.390929
min 40000.000000 0.000000 0.000000 0.000000
25% 78307.982976 3.000000 15.000000 0.000000
50% 95015.535141 5.000000 18.000000 0.000000
75% 111787.084948 5.000000 23.000000 0.000000
max 193155.942661 15.000000 59.000000 1.000000
4 K-Means Clustering Analysis
Clustering helps us discover natural groupings in the job market. Different clusters might represent entry-level vs. senior positions, different specializations, or regional variations.
4.1 Elbow Method for Optimal K
Clustering dataset: 15,378 samples
Inertia values by K:
K Inertia
0 2 46083.334533
1 3 37098.092228
2 4 30324.433241
3 5 24516.626582
4 6 22230.434222
5 7 20302.311134
6 8 18622.328855
7 9 17146.137474
8 10 16041.649025
4.2 Apply K-Means with Optimal K
Clustering complete with K=4
Cluster distribution:
Cluster
0 5057
1 5287
2 2282
3 2752
Name: count, dtype: int64
Cluster Characteristics:
AVG_SALARY EXPERIENCE_YEARS DURATION_DAYS IS_REMOTE
mean median mean mean mean
Cluster
0 114309.83 111978.57 3.88 16.26 0.00
1 77114.70 78684.40 5.37 15.91 0.00
2 95016.68 94501.34 4.39 41.52 0.06
3 94725.13 95721.16 4.41 19.64 1.00
4.3 PCA Visualization of Clusters
Variance explained:
PC1: 26.95%
PC2: 25.08%
Total: 52.03%
5 Regression Analysis: Salary Prediction
Understanding what factors drive salary differences can help job seekers negotiate better compensation and target high-paying opportunities.
5.1 Data Preparation for Regression
Regression dataset: 15,378 samples, 13 features
Salary range: $40,000 - $193,156
Median salary: $95,016
5.2 Multiple Linear Regression
MULTIPLE LINEAR REGRESSION RESULTS
==================================================
RMSE: $24,491.21
MAE: $19,614.15
R² Score: -0.0011
Model explains -0.11% of salary variance
5.3 Random Forest Regression
RANDOM FOREST REGRESSION RESULTS
==================================================
RMSE: $24,827.40
MAE: $19,926.56
R² Score: -0.0288
Model explains -2.88% of salary variance
5.4 Regression Model Comparison
6 Classification: Role Type Prediction
Understanding the distinguishing characteristics of different role types can help job seekers tailor their applications and skill development.
6.1 Create Role Categories
Classification dataset: 14,143 samples
Role distribution:
ROLE_CATEGORY
Data Analytics 11944
Business Analytics 1776
Data Science 419
Machine Learning 4
Name: count, dtype: int64
Percentages:
ROLE_CATEGORY
Data Analytics 84.451672
Business Analytics 12.557449
Data Science 2.962596
Machine Learning 0.028283
Name: count, dtype: float64
6.2 Prepare Classification Features
Classification features: 14
Samples per class:
ROLE_CATEGORY
Data Analytics 11944
Business Analytics 1776
Data Science 419
Machine Learning 4
Name: count, dtype: int64
6.3 Logistic Regression Classification
LOGISTIC REGRESSION CLASSIFICATION
==================================================
Accuracy: 0.8407 (84.07%)
F1 Score (Weighted): 0.7750
Classification Report:
precision recall f1-score support
Business Analytics 0.23 0.02 0.03 533
Data Analytics 0.85 0.99 0.91 3583
Data Science 0.00 0.00 0.00 126
Machine Learning 0.00 0.00 0.00 1
accuracy 0.84 4243
macro avg 0.27 0.25 0.24 4243
weighted avg 0.74 0.84 0.78 4243
6.4 Random Forest Classification
RANDOM FOREST CLASSIFICATION
==================================================
Accuracy: 0.8562 (85.62%)
F1 Score (Weighted): 0.8151
Classification Report:
precision recall f1-score support
Business Analytics 0.64 0.17 0.27 533
Data Analytics 0.86 0.99 0.92 3583
Data Science 0.89 0.06 0.12 126
Machine Learning 0.00 0.00 0.00 1
accuracy 0.86 4243
macro avg 0.60 0.31 0.33 4243
weighted avg 0.84 0.86 0.82 4243